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ABSTRACT 

This paper presents a new approach to select events of interest to 
a user in a social media setting where events are generated by the 
activities of the user's friends through their mobile devices. We ar- 
gue that given the unique requirements of the social media setting, 
the problem is best viewed as an inductive learning problem, where 
the goal is to first generalize from the users' expressed "likes" and 
"dislikes" of specific events, then to produce a program that can be 
manipulated by the system and distributed to the collection devices 
to collect only data of interest. 

The key contribution of this paper is a new algorithm that com- 
bines existing machine learning techniques with new program syn- 
thesis technology to learn users' preferences. We show that when 
compared with the more standard approaches, our new algorithm 
provides up to order-of-magnitude reductions in model training time, 
and significantly higher prediction accuracies for our target appli- 
cation. The approach also improves on standard machine learning 
techniques in that it produces clear programs that can be manipu- 
lated to optimize data collection and filtering. Q 

Categories and Subject Descriptors 

H. 2.8 [Database Applications]: Data Mining; 1.2.2 [Automatic 
Programming]: Program synthesis 

Keywords 

recommender systems, social networking applications, program syn- 
thesis, support vector machines 

I. INTRODUCTION 

At a high level, the problem of selecting events or updates of 
interest to a user in a social media setting appears similar to recom- 
mendation problems in other environments, such as offering book 
or movie recommendations on Amazon and Netfiix. In each of 
these, a user's previously expressed preferences are used to infer 
new items of interest; every time the user interacts with the site, the 
system builds a more accurate picture of what she likes and dislikes 
and uses it to improve recommendations. Social media, however, 
poses some unique challenges which demand a different approach 
from the standard collaborative filtering, where other users' prefer- 
ences are used to infer about what the user will like [ 12"||15). 

To illustrate some of the new challenges that recommendation 
systems face in this domain, we focus on an application called Life- 
Join 13). We designed this application to model the future of social 
networking, where a person's profile is continuously updated (mod- 
ulo a privacy filter) by an automatically generated event stream 

1 A shorter version of this paper appeared in CIKM' 12. 



from the user's mobile devices, including her location and activ- 
ities (e.g., running, sitting on a bus, in a meeting, etc). The sys- 
tem also attempts to discover interesting co-occurrences in friends' 
event streams, such as a meeting of two of the user's friends in a 
nearby pub. In order to deal with the data deluge, the system gives 
the user the ability to "like" and "dislike" both individual and com- 
binations of events. LifeJoin uses the expressed likes and dislikes to 
infer what kinds of events are of interest to the user, which can then 
be used to auto-populate the user's newsfeed or notify her of inter- 
esting nearby social events. Collecting all sorts of events through a 
mobile device can consume a lot of energy [7], so LifeJoin uses the 
inferred user's interest to drive subsequent event acquisition. For 
instance, if LifeJoin infers that Mary's friends are only interested 
in the places she goes for a jog, then the system will save power on 
Mary's device by turning off data collection when she is not jog- 
ging. Our initial experiments have shown that implementing the 
data collection scheme in the scenario above can extend the phone 
battery life by up to 40% |5|. Thus, the more accurate we can de- 
tect the users' real interests, the more energy we can save in data 
collection as compared to a scheme that collects all data under all 
circumstances. 

More specifically, inferring interests in LifeJoin poses four unique 
challenges: 

1. Decomposable Models: For applications such as LifeJoin, 
models must be decomposable into simple classifiers that can be 
pushed down to the individual devices to drive event acquisition. 
One simple way to ensure a model is decomposable is to limit it to 
only contain boolean combinations of simple predicates over the in- 
put features, which can be decomposed in a straightforward way to 
indicate the required data from phones. Such models are also use- 
ful because they allow users to give explicit feedback about whether 
the system actually understands their true interests, and to manually 
tune the models to better suit their preferences as discussed in 1 12| . 
By contrast, many existing preference learning algorithms produce 
black box classifiers that are difficult to decompose and understand. 

2. Active Learning: Given the large number of incoming events, 
and the large number of ways in which they could be combined, it 
is unreasonable to ask the user to rate any meaningful fraction of 
them. Thus, the learner needs to intelligently choose a subset of 
incoming events that can most improve the current model. In ad- 
dition, the domain of users mentioned in the incoming events can 
also change over time as the user's friends network changes. 

3. Noisy, Skewed Data: Since the ratings are produced by hu- 
mans, they are bound to contain occasional errors. Users also change 
their interests over time, so the same event might be given different 
ratings depending on when it was shown to the user. At the same 
time, each user's definition of "interesting" is different, so it is dif- 
ficult to make generalizations about the statistical properties such 
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as the anticipated degree of skew in users preferences. In fact, this 
is currently an active research topic on its own 1 3 1. 

4. Personalized Events: Unlike typical recommendation sys- 
tems such as those for books, movies, or online ads, where all users 
rate a common set of items, the events in LifeJoin tend to be highly 
personalized. For instance, a user might like an event because it 
involves her best friend Peter, but the same event would be totally 
meaningless if is shown to another user who does not know Peter. 
Thus, we believe it is easier to learn a model for each user individu- 
ally (i.e., event is of interest if it involves Peter, without needing to 
know the relationship between Peter and the user) rather than try- 
ing to discover the relationships between users and design a model 
that is applicable to all. 

These requirements preclude the use of collaborative filtering 
(CF) techniques which have been successful in other recommenda- 
tion systems — such as building neighborhood or latent factor mod- 
els to predict user ratings. In particular, these techniques tend to 
generate models that cannot be used to drive data acquisition and 
generate an explainable model to the user to solicit further feedback 
(req 1). For instance, a neighborhood model-based approach might 
attribute a new rating based on a set of previously rated events that 
are deemed similar, but it is unclear how the system can easily gen- 
eralize from the set of similar events to determine what new events 
to collect. Furthermore, CF techniques require a similarity measure 
between users or events. It is unclear how that can be done in a set- 
ting where events are highly personalized to a small set of users (req 
4); this is an active research topic |13||14) , and the proposed solu- 
tions require explicitly modeling all social relationships between 
users, rather than simply learning a separate model for each user 
individually, which does not require discovering the relationships 
among the users. 

We avoid the above issues by viewing the problem as an induc- 
tive learning problem with an active learning component: given a 
set of labeled examples, the goal is to learn a set of rules that rep- 
resents an individual user's preferences, and to choose new events 
for the user to rate. Unfortunately, standard inductive learning al- 
gorithms such as those based on entropy measures (e.g., decision 
trees and inductive logic programming tools) are known have issues 
with skewed data (req 3) [4|, it is not clear how active learning can 
be applied, and they also do not provide good generalization guar- 
antees when compared to statistical-based learners such as support 
vector machines (SVM). 

Recently, the programming languages community has been ex- 
ploring inductive learning problems in the context of software syn- 
thesis in programming-by-example systems |9|, where the goal is 
to infer a program from a set of sample behaviors. Unfortunately, 
the learning problem in Lifejoin is different enough that none of 
the previous techniques from this community can be applied out 
of the box. In particular, the active learning problem has not been 
sufficiently addressed by previous research from this community. 
Nevertheless, these techniques provide a new set of tools that can 
be leveraged to attack the problem. 

In this paper, we present a new algorithm to infer users' interests; 
the algorithm combines new techniques in program synthesis with 
more traditional machine learning approaches to satisfy the unique 
requirements illustrated by the LifeJoin application. Specifically, 
we make the following contributions: 

1. We show that both the classical machine learning approach 
and an approach based purely on program synthesis do not ade- 
quately address this problem. 

2. We describe a hybrid approach that employs program synthe- 
sis to generate a number of classifying functions, and subsequently 
asks an SVM to assign weights to the features in each generated 



functions. We show that, when compared to pure machine learn- 
ing or synthesis approaches, this hybrid technique takes up to an 
order of magnitude less time to encode the training data into a fea- 
ture space representation, and improves upon traditional learning 
algorithms by 30% in overall classification accuracy. 

3. We show that we can use a program synthesizer to produce 
more decomposable and human-understandable models than those 
generated by traditional machine learning techniques, and provide 
empirical evidence that the generated models are comparable to the 
original intentions that the user has in her mind. 

We have implemented the learning technique in the context of 
the LifeJoin application. However, we believe that our approach is 
applicable to other social networking applications as well, where 
large amounts of data are collected from users, and labels provided 
by users contain errors or interest drifts. In the next section we give 
an overview of the various steps in the learning task in LifeJoin, 
and illustrate our approach with an example. 

2. OVERVIEW OF THE APPROACH 

In this section, we illustrate the recommendation problem with a 
concrete example and present an outline of our solution. To frame 
the problem, consider the LifeJoin event stream, which contains 
large numbers of events about the activities of a user's friends and 
family. Out of this event stream, suppose that the user is interested 
in events where her friend Joe is away from home either late at 
night or early in the morning: 

(user = Joe) A (location Home) A (time < 9am V time > 9pm) 

The goal of the system is to infer this interest function based on 
events the user rates as having liked or disliked. We want the al- 
gorithm to produce its interest function in the form of a predicate 
like the one above because that helps ensure the decomposability 
described earlier. When the interest function is expressed in this 
form, it can be easily manipulated and decomposed into predicates 
that can be pushed down to individual users' phones to optimize 
the data acquisition process as described. Such expressions are also 
comprehensible by users, and can be manually adjusted to tune the 
results the user sees. We are not aware of any statistically-based 
methods (such as CF or SVM) that can directly generate models 
like these. 

In the absence of additional information about the expected dis- 
tribution of the events, the most naive approach to finding an in- 
terest function is to exhaustively explore the space of all possible 
predicates of the desired form until a set of predicates is found that 
matches all the previously labeled events. The most obvious prob- 
lem with such an approach is that the space of possible predicates is 
enormous — on the order of 10 40 in some of our experiments. How- 
ever, as we will describe in Sec. [3] new technology from the field 
of combinatorial synthesis [23| can find a matching interest func- 
tion in this space in a few seconds. For example, Fig. [T] shows a 
sample of labeled data and a few interest functions that were found 
this way to match the data. 

A deeper problem with the naive approach is that predicates 
found this way cannot be expected to have much generalization 
power — that is, they are unlikely to correctly classify as yet unseen 
data items. Individually, they will also not be of much use in opti- 
mally determining the next data element to present to the user for 
labeling. To address this problem, we rely on the idea of boosting 
(20). After the combinatorial synthesis algorithm has found K in- 
terest functions f,, each of these functions can be treated as a weak 
base learner, and the group forms an ensemble. 

The standard way of forming the ensemble is to learn a linear 
function F(e) = 51 w, • fi(e), where an event is classified as inter- 
esting if F(e) > 0. The ensemble allows us to follow a standard 
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Each line below denotes a potential classifier 



(User — Joe) A (location — Office V location = Bar) A (time < 7am V time > 10pm) 
(User / Bill) A (time > 10pm V location — Bar) 
(User = Joe) A (time < 9am V time > 11am) 



Figure 1: Learning example with labeled data (left) and candidate classifiers that are consistent with the labeled data (right) 



approach for active learning, namely, to select those events that 
are closest to the boundary where F(e) = 126). Normally, the 
weights Wj are selected based on the training data, but in our case, 
since all the functions f, were selected to agree on all the train- 
ing events, that leads to all functions having equal weight. That 
means that the ensemble reduces to a majority vote, and the ac- 
tive learning strategy reduces to selecting the event that causes the 
maximum level of disagreement among all the candidate interest 
functions. We refer to this pure synthesis based algorithm as the 
"ensemble" approach. As we will see in Sec. [5] such an approach 
already outperforms many standard learning techniques, but we can 
do better. 

When defining the space of candidate interest functions, we re- 
quire the functions to be in disjunctive normal form. This means 
that every function f, can be seen as a disjunction of individual 
predicates pij. We exploit this structure when building the ensem- 
ble; instead of an ensemble F(e) = Iw, ■ f,(e), we build an en- 
semble of the form F'(e) = 5Zw,-j ■ p (J (e). Finding weights for 
each predicate is no longer trivial. We use an SVM to find a set of 
weights for the function, which has the additional benefit that the 
weights will be set in such a way that the resulting classifier will be 
maximum-margin one. As we will see in Sec. [5] defining the en- 
semble in this way significantly improves active learning, and we 
call this combination of program synthesis and machine learning 
techniques the "hybrid" approach. 

As our experiments show, this approach also copes gracefully 
with errors in the training data. We can improve its handling of 
errors by selecting each of the f, functions to match only a ran- 
domly chosen subset of the data. If the rate of errors is low, this 
ensures that at least some of the f) will be selected to match only 
uncorrupted data. 

One issue that still has to be addressed is that the SVM may 
find fractional values for the weights, so the function F'(e) will 
no longer be a well-formed boolean predicate. Once again, we use 
combinatorial synthesis technology to find a well-formed predicate 
P(e) that is closest to the linear function F'(e). Such predicates 
have the decomposability property we desire. 

Now that we have described our basic approach and problem 
setup, we describe the synthesis technology we use to solve the 
problem in more detail in Sec. [3] as well as the details of our hybrid 
approach in Sec. [4] Sec.[5]presents our experiments on a synthetic 
data set derived from the LifeJoin scenario, showing substantial 
performance gains for the hybrid approach. Finally, we discuss 
related work in Sec. [6] and conclude in Sec. [7] 



3. CONSTRAINT BASED SYNTHESIS 

In recent years, there has been a lot of interest in the program- 
ming languages community around constraint-based approaches to 
program synthesis |23| |22| [TO] |24| . At a high-level, this technol- 
ogy provides an efficient mechanism to search a space of candidate 
programs for one whose behavior satisfies a given specification. 

The synthesis problem can be seen as a generalization of tra- 
ditional curve fitting, where a space of possible curves — say, the 
space of all polynomials of degree less than k — is explored in search 



of one that satisfies a given set of requirements. Modern synthesis 
systems go several steps beyond simple curve-fitting by providing 
rich languages for describing requirements and spaces of candi- 
date programs. The search for a correct solution in this space is 
performed symbolically; i.e., the space of candidate programs is 
described through a set of equations which are solved through a 
combination of inductive and deductive methods by a specialized 
solver. LifeJoin uses a synthesis system called SKETCH (23). In 
the reminder of this section, we give a brief overview of Sketch, 
and show how our system uses it to generate candidate solutions to 
the problem. 

3.1 The Sketch Synthesis System at a Glance 

Sketch extends a simple procedural language — think C or Pas- 
cal — with new constructs that allow users to write programs with 
holes, i.e., missing expressions that must be completed by the syn- 
thesizer. The language allows programmers to use recursive defini- 
tions to describe the space of expressions that can be used to fill a 
hole. For example, consider the following program: 

int foo(int x, int y) { return expr(x,y); } 



generator int expr(int x, int y) { 
return {| ?? | x | y | expr(x,y) H 

} 



expr(x.y) |}; 



The program defines an expression expr to be either a constant, a 
variable x or y, or a sum of similar sub-expressions. In essence, the 
generator defines a grammar for the possible expressions that can 
be returned. 

Given the grammar, the user can constrain the behavior of the 
desired expression by writing test harnesses. For example, the test 
harness below ensures that the value returned by the function is 
greater than twice the first parameter when the second parameter 
is greater than zero, and ensures also that when x=5 and y=8 the 
function produces 10. 

harness void main(int px, int py) { 

if (py > 0) { assert foo(px, py) > 2*px; } 
assert foo(5, 8) == 12; 

} 

Given such a program as input, the Sketch system discovers that a 
plausible solution is for foo to return x+x+2. 

To understand how the technology works, consider the generator 
expr which describes a set of possible expressions. One way to 
understand this generator is that every time it is called, the system 
has to make a choice about what to return. In order to turn foo 
into a concrete piece of code, the system needs a strategy to make 
those choices; i.e., it needs to find a recipe for how to make the 
choices in expr to ensure that the correct answer is produced every 
time. Sketch encodes such a recipe as a vector of bits c, so the test 
harness can be seen as taking c as an additional parameter. The 
goal of finding a strategy that works every time reduces to finding 
a value of c that satisfies an equation of the form 

Vpx, py.P mam (px, py, c) 
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The predicate P mam is derived from the test harness main automati- 
cally by a compiler, and is true if the strategy c causes the function 
to pass the assertions when run with inputs px and py. Sketch 
translates the equation above into a series of boolean satisfiability 
problems. Unlike traditional inductive learners such as decision 
trees (which gives poor results as discussed in Sec.[T|and Sec. [5}, 
the core algorithm in Sketch works by forming an initial hypothesis 
about the solution, then iteratively finds instances from the harness 
that fails the hypothesis and incorporates them into the hypothesis 
itself. The process repeats until the harness is satisfied. In prac- 
tice, this tends to be quite fast in in terms of solution generation 
time as our experiments show, and the details of the algorithm are 
described in [23 J. The algorithm itself is NP-complete as it uses a 
SAT solver as the backend. However, in practice many of the prob- 
lems, such as those in LifeJoin, can be solved in very short time as 
our experiments show. 

3.2 Encoding the Space of Interests using Sketch 

Given the above, we now discuss how we use Sketch to aid in 
feature selection in LifeJoin. One of the problems with feature 
selection is that there is an exponentially large space of possible 
features, so analyzing them one by one to identify those that bet- 
ter predict the labels in the training data is prohibitively expensive. 
By contrast, constraint-based synthesis allows us to represent the 
entire space of possible interest functions as a compact sketch that 
uses a grammar to describe the space of all possible solutions to the 
classification problem. 

As mentioned in Sec. [2] we would like to generate interest func- 
tions that select data elements from the stream of events collected 
from users' phones by returning a boolean value given events from 
the event streams. To generate the appropriate interest function, 
we encode its grammar using Sketch similar to that of the exam- 
ple above. LifeJoin currently collects two streams of events from 
users' phones: a stream that describes a user's activity (walking, 
running, etc), and another that describes a user's location. Both 
event streams come with timestamps that describe the start and end 
time of each event along with the user involved and. Given that, we 
encode the space of interest functions using a grammar with predi- 
cates from the two event streams, as shown in Fig. [2] Each interest 
function consists of a disjunction of interests and returns a boolean 
value. Each interest takes in an activity and a location event, and 
consists of a conjunction of event predicates. Each event predicate 
is either one that restricts the set of events from either event stream, 
or is a join predicate that links events from both data streams, for 
instance the user from the location event has to be the same as the 
user from the activity event. As an example, a user who is inter- 
ested in events about Peter running along the Charles River can be 
represented with the interest function: 

a. user = Peter A a. activity — running A 
/.location — Charles River A a. user = I. user 

We formulated our grammar based on initial user studies, and 
further predicates (such as average duration of events) can be incor- 
porated as needed. In our experiments we also bound the maximum 
number of disjuncts and conjuncts allowed in the interest functions 
and interests during synthesis, along with the set of users, activities, 
and locations. 

Following the example above, with the grammar for interest func- 
tions we use the previously-labeled events from the user as the har- 
ness. We then ask the Sketch system to generate an interest function 
that satisfies the labels on the training events. And each function 
generated becomes a weak base learner as discussed in Sec. [2] 



f (a, I) £ interest function 
i(a, I) 6 interest 

a £ activity 
/ £ location 
ap(a) 6 activity pred 



'p(') £ location pred ::= 



jp(a, I) £ join pred ::= 



V ^ o 

k 

A (ap(a)|/p(/)|jp(a,/)) 

k 

{user, activity, start, end} 

{user, location, start, end} 

a. user op {Users} 

a. activity op {Activities} 

a. start op N \ a. end op N 

(a. end — a. start) op N 

I. user op { Users} 

/.location op {Locations} 

/.start op N \ /.end op N 

(/.end - /.start) op N 

a. user op /.user \ a. start op /.start 

a. end op /.end \ a. start op /.end 

a. end op /.start 

(a. end — a. start) op (/.end — /.start) 



Figure 2: Grammar of Interests 

learnModel (posEs, negEs) { 

(posTrainEs, negTrainEs) = subsample(posEs, negEs); 
baseFns = callSketchfharness, posTrainEs, negTrainEs); 
preds = extractPredicates(baseFns); 
m = createSVMModel(preds, posTrainEs, negTrainEs); 
return model; 

} 

generateDecomposableModel (model) { 

supportVectorEs = getSupportVectors(model); 
decompModel = callSketch(supportVectorEs); 
return decompModel; 

} 

activeLearningRound (posEs, negEs, unratedEs, 
numSamples) { 
model = learnModel(posEs, negEs); 
for (e in unratedEs) 

ratings[e] = computeRatingfmodel, e); 
sortedEvts = sortByAbsValue(ratings); 
decompModel = generateDecomposableModel(model); 
return (sortedEs[0;numSamples], decompModel); 



} 



Figure 3: Hybrid Algorithm 



With this in mind, we next discuss how the weak base learners 
are combined in the ensemble and hybrid approaches. 

4. THE HYBRID APPROACH 

In this section we discuss in detail our ensemble and hybrid ap- 
proaches, and provide insights into why the hybrid one performs 
better than the ensemble one. 

4.1 Hybrid Algorithm 

As mentioned in Sec. [2] our classifier works by first generating a 
number of functions that are capable of fully explaining the train- 
ing data. Unfortunately, the ensemble approach does not provide 
any generalization guarantees. However, as we later pointed out in 
the same section, we can instead break the weak learners into their 
predicate constituents and treat them as base features, and then use 
them as features to train a SVM classifier. Classification is then 
done using the SVM, with events classified as interesting if it re- 
turns a value > 0, and is not interesting otherwise. Figure[3]outlines 
this hybrid algorithm in pseudocode form. 

Learning begins by giving the set of positively labeled (i.e., those 
labeled as "interesting"), and negatively labeled events to learnModel 
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on line[T| which first invokes the Sketch synthesizer to generate a 
number of functions (the number to generate is a parameter to the 
algorithm). The functions are then passed to extractPredicates on 
line [4] which extracts and returns the set of base predicates from 
each function (e.g., user = John). The base predicates are then 
passed to the SVM to generate a model that returns a numerical 
rating ranging from -1 to 1. The model is used in classification of 
incoming events (not shown in Fig. |3j, where the incoming event 
is negatively labeled if the rating is less than 0, and is positively 
labeled otherwise. 

Then, during each round of active learning, activeLearningRound 
on line[l5jis called with the list of previously rated events, the list 
of unrated events to choose from for subsequent user querying, and 
the number of events to choose. It first constructs a model using 
learnModel on line[l7]with the list of previously rated events. Then, 
for each unrated event, it asks the model to compute its (numeri- 
cal) rating; events are then sorted according to the absolute values 
of their ratings, and the ones that are closest to (i.e., the ones that 
are the most uncertain according to the current model) are chosen to 
query the user for labels. At the same time, generateDecomposableModel 
on line [2I] is called to create a model representation to drive sub- 
sequent data acquisition. In our experiments the time taken to con- 
struct models is typically short. 

Noisy data might prevent Sketch from generating any candidate 
function since the ratings might be contradictory. The subsample 
function on line|2]is used as a means to remove contradicting inputs 
prior to model training. Even though more sophisticated methods 
can be used, our experiments have shown that the simple sampling 
method is good enough to give reasonable performance in presence 
of noise. 

In a sense, one can view the hybrid approach as using the Sketch 
synthesizer as a feature selection mechanism, and feeding the se- 
lected predicates into the SVM to build the resulting classifier. To 
test that view, we have implemented other standard feature selec- 
tion algorithms and provide comparisons in Sec. [5] 

4.2 Generating Decomposable Models 

The output learned by a linear SVM is a model consisting of a 
linear function made up by a selected set of predicates, and a list 
of the all the input predicates and weights for each of them. The 
weights for each predicate are computed using standard method- 
ology from the weights the SVM assigns to each input event in- 
stance. Unfortunately such a model does not decompose well into 
per-device filters usable for further data acquisition. On the other 
hand, given the input training data, a program synthesizer is able 
to generate a classifier that is decomposable, but unfortunately pro- 
gram synthesizers do not provide any generalization guarantees. 
Fortunately, SVM is able to help us in that respect, since it already 
identifies the subset of the training data that is used to define the 
separating hyperplane, otherwise known as the support vectors, and 
in most cases, the number of support vectors is much smaller than 
the size of the entire training set, thanks to the SVM's regulariza- 
tion feature. Thus, as a post-processing step, we feed the events 
that are labeled as the support vectors to the synthesizer and ask it 
to generate a decomposable model. Even though the resulting clas- 
sifier generated by the synthesizer might not be exactly the same 
as the one generated by the SVM (for instance, it might pick pred- 
icates that have low weights as assigned by the SVM, but nonethe- 
less can still classify the incoming events), in Sec. [53] we present 
empirical evidence that our approach does indeed generate decom- 
posable models that are similar to what the user originally has in 
mind. 



5. EXPERIMENTS 

In this section we present our experimental results. The overall 
goal of the experiments is to compare various aspects of the differ- 
ent learners. 

5.1 Methods Compared 

We used the LifeJoin platform for experimental purposes. Life- 
Join collects events from two different event streams. One of them 
is about the location of users, with fields (user, location, start time, 
end time), and the other one about users' activities, with fields 
(user, activity, start time, end time). As mentioned in Sec.[T] we are 
interested in composite events where events from the two streams 
can be combined in different ways, for instance joining them on the 
user fields, and part of the learner's goal is to learn how to combine 
the two event streams to generate interesting events. In the follow- 
ing we use locF as a shorthand for field F in the location event (and 
similarly for actF for the activity event), and duration is shorthand 
for the length of the corresponding event (i.e., end time - start time). 
To simplify the description we represent users and activities using 
numbers rather than actual names. We use the unary features to 
describe those that involve only one comparison operation, such as 
locUser = 3, and conjunctive features for those that involve mul- 
tiple comparisons connected with conjunctions, such as (locUser = 
3 and activity = 4). In the rest of the section full refers to the set of 
all unary and conjunctive features together. 

For the evaluation we implemented eight different learners, as 
shown in Fig. [4] The LI and MI methods are both classical ma- 
chine learning approaches based on an SVM classifier. Here the LI 
approach uses the LASSO algorithm for feature selection, followed 
by a linear SVM for classification. The MI approach performs fea- 
ture selection by computing the mutual information between each 
of the features and the output label, and picks the features with the 
highest scores for subsequent classification using a linear SVM. 
Both of these methods enumerate the full feature set on the train- 
ing data before feature selection. The ensemble learner represents 
the program synthesis approach described in Sec. [3]and hybrid 
represents our new hybrid approach described in Sec.|4] 

Tree is the learner created by first learning a decision tree using 
the C4.5 1 19 1 algorithm using the weka 1 1] toolkit, and then creat- 
ing features by extracting the path(s) from the root node that leads 
to the leaves that classify the event as interesting, as in [25]. In or- 
der to avoid degenerate trees, we lowered the support for splitting 
and did not prune the generated tree. We have also experimented 
with random trees and the results are similar. The resulting features 
are then used to train an SVM for classification. 

Full is an SVM learner that uses no feature selection on the full 
set of conjunctive features as mentioned above, unary is an SVM 
learner that has no conjunctive features, poly uses the same set 
of non-conjunctive features as unary except that the features are 
passed through a polynomial kernel. We did not consider other 
types of kernels such as radial basis kernel as they combine the in- 
put features in a way that does not produce decomposable models 
(req 1 from Sec.[TJ. For the learners that involve SVMs, we tuned 
the parameters (e.g., amount of regularization) using crossfold val- 
idation, and we set the degree of the polynomial kernel to be 6 after 
trying all kernels of degree 2 to 8. We applied the polynomial ker- 
nel to other learners (full, LI, MI, ensemble, tree) as well, but that 
did not improve the results. 

5.2 Experiment Setup 

We generated a synthetic data set in which we modeled 5 users, 
randomly and uniformly selecting one of 5 location to visit. Each 
user remains at the location for a random period of time (ranging 
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Figure 4: Description of the learners used 

from 1 to 10 hours), and randomly and uniformly selects one of 5 
activities to perform at the location. This is meant to model the type 
of input data that LifeJoin produces. The experiments were run on 
a server with 32 cores and 30GB of RAM. We try to execute the 
experiments in parallel as much as we can. We choose to evaluate 
our methods on synthetic data rather than actual data since no pub- 
licly available large data set is available, and using synthetic data 
decouples us from the potential errors in data collection or event 
identification on the phones. The data set is generated randomly 
and does not favor or disfavor any particular learner. 

In addition to a data set, we need a way to generate user in- 
terests (for labeling training data and to generate ground truth for 
purposes of evaluating the performance of the different learners.) 
To do this, we manually created 6 different interest functions of in- 
creasing complexity and used numerical values to represent users, 
locations, and activities, as shown below. 

1 . locUser = actUser 

2. (locUser = 3 A locDuration > 1 A activity = 0) V (locUser = 
A activity = 2) 

3. (location = 3 A locUser = actUser A locDuration > 3) V (lo- 
cation = 2 A activity = 1 A actDuration > 2) or (locUser = 1 A 
locDuration > 4) 

4. same as 3. plus disjunct: (actUser = 3 A activity = 2 A actDu- 
ration > 1) 

5. same as 4. plus disjunct: (actUser = 1 A actDuration > 1) 

6. same as 5. plus disjunct: (locUser = actUser A actUser = 2 A 
locStartTime - actStartTime < 2) 

Each of the interests above causes different amount of class im- 
balance in the input training data. For instance, the first interest 
function labels about 40% of the events to be positive, whereas the 
last (most complicated) interest function labels only about 10% of 
the events as positive. This is to model how class distribution can 
vary drastically among different user interests. 

For all the experiments we allow Sketch to learn a maximum of 
14 different interests, and allow each interest to consist of a max- 
imum of 7 different conjuncts. The numbers were picked from 
initial sampling of 5 users. Obviously limiting to 7 conjuncts is 
more than needed in order to learn the predicates listed above, but 
we used that setting for two reasons. First, we believe this level 
of interest complexity is a reasonable approximation of the max- 
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Figure 5: Cross validation accuracies on error-free training 
data 
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Figure 6: Cross validation feature set sizes on error-free data 
(poly and tree have the same # as unary) 

imum complexity of interests a user might have. Second, for the 
experiments below that contain errors in the training set, limiting 
the number of interests to be too small could result in Sketch not 
being able to find a satisfying model. 

To generate training data (and validate the performance of our 
learners), we labeled data points in our data set using each of these 
interest functions, assigning a positive label to the event for a given 
interest function if the interest function evaluates to true. 

5.3 Cross Validation Experiments 

In the first set of experiments, we evaluate the accuracies of the 
different schemes using cross validation. The goal of this experi- 
ment is to evaluate learner performance in the absence of any per- 
formance anomalies the active learning methods may introduce. 

For each of the predicates we first generated a dataset of 100 
positively and 300 negatively labeled events. The events are uni- 
formly sampled from a domain consisting of 5 users, 5 locations, 
and 5 different types of activities. We ran 10-fold validation on the 
dataset, where we divide the positive and negative events into 10 
partitions. Figure[5]shows the average accuracies and Fig.|6]shows 
the number of features that are actually used for classification. 

The results show that our hybrid learner has similar accuracy as 
compared to standard machine learning techniques. At the same 
time, it does not require using the full feature set as in the other 
learners such as LI or MI. This is particularly important when com- 
paring the number of features that are used for classification. To 
achieve the same overall accuracy, the number of features used by 
the hybrid and ensemble learners are an order of magnitude smaller 
as compared to others, as shown in Fig. [6] 

Next, we repeated the same experiment, but this time we intro- 
duced an 5% error into the training set. Here error refers to the 
chance that a given event in the training set if mislabeled, i.e., an 
event that is labeled as "interesting" is reversed to be "uninterest- 
ing" and vice versa, but the test set is error-free. This is to model 
human error or user interest drifts over time. 

The results shown in Fig.[7]is similar to the case without errors, 
except that the average accuracies of all the learners are lowered, 
as expected. It also took longer for the experiments to complete 
(about 1-2 hours per fold) due to the complexity introduced by the 
erroneous events. 

5.4 Active Learning Experiments 

In the next set of experiments we evaluate the learners in the 
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Figure 7: Cross validation accuracies on data with 5% error 

actual usage setting, where the user is asked to label a few new 
data points each time she visits her newsfeed. At the end of each 
round the learner is given the newly rated events along with the 
previously rated ones to refine its model about the user. The goal 
of the learner is to select the list of events to present in each round 
so as to maximize the accuracy of the model, and to do so with as 
few rounds as possible. 

5.4.1 Basic Setting 

For evaluation purposes we generate 100 positive and 300 nega- 
tive events as a training set to be presented during active learning. 
The events are generated using the same settings as in the cross val- 
idation experiments. We then generate an additional 10k events and 
ratings (which are not given to the learners) to use as the test set. 
The events in the test set are generated randomly without regards 
to the ratio of positive and negative events (about 10% - 40% of the 
test events are positive, depending on the interest function). Ini- 
tially, the learners are given 1 positive and 1 negative event to learn 
an initial model. Then, during each iteration, the learners choose 
5 events from the training pool to query for their ratings to rebuild 
the model. We measure the accuracy of the model at the end of 
each round for 20 rounds. Figure[9ja) shows the results on predi- 
cate 6, averaged over 10 runs. The results for the other predicates 
are similar but the learning rates tend to be higher for less complex 
predicates as explained next. 

The focus of these results is the learning rate, i.e., the rate at 
which the accuracy increases. As the results show, while the learn- 
ers that use classical feature selection mechanisms (LI and MI) do 
have higher learning rates as compared to those that do not (full 
and unary), our hybrid and ensemble learners have a significantly 
higher learning rate than any of the others, due to the fact that they 
are able to pick features with higher predictive power, as discussed 
in Sec. |4] 

Figure [8] shows the number of features that are used for classi- 
fication in each round for the learners. While they all increase as 
the number of rounds increases as expected, the growth rate for the 
hybrid and ensemble learners that use Sketch for feature selection 
is much slower than the others. 

As a note, we also experimented with skewed data, where the 
training and testing data are biased towards certain users and loca- 
tions (to model popular events, and thus the class imbalance is less 
severe), along with another experiment where we varied the num- 
ber of events to add per round of active learning. The results are 
similar to those from the basic setting. 

5.4.2 Effect of Errors 

Next, we introduce errors into the training events as in the cross 
validation experiments and quantify the effect on accuracies. We 
introduced K% error. We run two sets of experiments where we 
introduced 5% and 10% error. Figj9jb) shows the results running 
predicate 6, with 5% error, and Fig.[9Fc) shows the results with 10% 
error. 

The average accuracy of all the learners is lowered versus the no 
error case, as expected. Also as expected, the 10% error case is 
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Figure 8: Average feature set sizes used by learners on error- 
free data (poly and tree have the same # of features as unary) 

worse than the 5% error case. However, in both cases, the hybrid 
learner still performs better than the others. The number of features 
used (no shown) exhibits a similar trend as in the no error case, 
except that all learners end up using a larger number of features as 
a result of the introduction of noise. This shows the power of our 
approach — by not having an implicit assumption about the class 
distribution of the training data, the hybrid learner performs better 
than those that do. 

5.4.3 Making Use of Extended Labels 

One of the advantages of the hybrid learner over the ensemble 
learner is that the SVM in the hybrid learner is able to make use 
of extended labels. This is because extended labels simply change 
the problem from classification to regression, where instead of a 
binary label (e.g., "like" or "dislike"), the goal is to predict ratings 
on continuous a scale from -1 to +1. In this experiment, we repeat 
the same experiment as in the basic setting but with extended labels 
for events. For events that are of interest, the label remains as +1 
as before. For those that are not of interest, the label is negative, 
but its value is computed in the following way. Given the user's 
interest expressed as N disjuncts Vc/,, where each d\ is a conjunc- 
tion of predicates, then if the event e fails all disjuncts, the value 
of its label is computed as m\n(#failed(di, e)/#(c/,, e)), where 
#failed(d;, e) is the number of predicates that e has failed within 
d;, and #(c/, , e) is the total number of conjuncts in d,. We chose 
to pick the minimum since this represents the minimal number of 
changes in e that would make the user happy. We present the ac- 
curacy results in Fig.[l0]for running on predicate 6, and they show 
that the learning rate for the hybrid-regression learner is faster as 
compared to the ensemble and original hybrid-binary learners. This 
makes sense since the regression learner is able to make use of the 
extended information that is embedded within the "near miss" cases 
in selecting better samples during each round of active learning. 

5.4.4 Large Domain 

In the next experiment we increase the number of users and the 
number of locations from 5 to 50, and the number of activities from 
5 to 10. This is to model a user who has more friends and visits 
more locations. We generated the 100 positive and 300 negative 
training events from the new domain using uniform sampling as 
before, and an additional 10k events for the test set. We execute 

2 Since we are learning a separate model for each user, we do not need to 
scale up to, say 1M for the number of users or locations as it is unlikely that 
a given user would have that many friends or locations traveled. 
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Figure 9: Average accuracies of learners using (a) error-free, (b) 5% error, and (c) 10% error data. 
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Figure 10: Extended label experiment results 

the same active learning experiment as before. Figure[TT]shows the 
results. 

On the outset, it seems that all learners achieve high overall ac- 
curacies on the test data, but close examination proves that not to be 
the case. In particular, unlike previous experiments where the ratio 
of positively and negatively rated events is not heavily skewed, in 
this case, due to the large domain size, only around 3% of the events 
in the test set are positively rated, so the learners quickly learn to 
assign negative to most test events in order to maximize overall ac- 
curacy. The result is a model with high precision on the negatively 
rated events and very low precision on the positively rated ones. 
The decision tree based classifier, however, decided rather to gener- 
alize on the positively labeled events and classifies almost all events 
as interesting. As a result, it achieves high accuracy on the positive 
events and poorly on the negative ones, resulting in low overall ac- 
curacy. To illustrate this, Fig.[T2]show the accuracy results on just 
the positive events. The figures show that even though the overall 
accuracies of the learners are comparable, the hybrid and ensemble 
approaches actually perform much better than the other learners on 
the positive events. 

This experiment raises two important points when comparing 
among the learners. First, all of the learners except for hybrid, 
ensemble, poly, and tree require enumeration of the full feature set 
for all events. In this large domain case, this takes a substantial 
amount of time (2 hours for conversion into the feature represen- 
tation) and disk space (300 MB needed to encode 10k events), as 
compared to the synthesis-based feature selection approach used 
in the hybrid and ensemble learners, which takes much less time 
(10 min to finish the Sketch runs and seconds to convert the cho- 
sen features into feature-space representation) and negligible disk 
space (600 kB to encode 10k events). Secondly, the fact that the 
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Figure 13: Model explanations using Sketch 

classical machine learning based learners assign negative labels to 
most events means that they will very likely not be able to identify 
any interesting events for the user, which is the ultimate goal. 

5.5 Model Explanation Experiments 

In these experiments we test the effectiveness of using a pro- 
gram synthesizer at producing a decomposable model (which will 
provide human readability and the ability to be pushed down onto 
a phone for data acquisition purposes). As discussed in Sec. |4.2| 
we took the support vectors after model generation and fed them 
into Sketch. In an attempt to generate a minimal description of the 
model, we ran Sketch iteratively, first assuming the user has only 1 
interest and asking Sketch to generate a description of the model. 
If that fails we increase the number of interests until Sketch is able 
to find a description. We took the data from one of the cross valida- 
tion experiments without error consisting of 400 events. Figure [T3] 
lists the number of iterations needed for each of the predicates to 
produce the model description, the number of support vectors used 
as inputs, along with the actual description generated. 

Although the learned predicates do not perfectly match with the 
predicates used to generate the labels for the data, they are quite 
similar, and are relatively easy to determine what data to subse- 
quently collect on the phones. The results also show the power of 
using SVM to reduce the number of input events that are needed to 
feed into the synthesizer, where in the best case (interest 1) we only 
need to give 5% of the original training events in order to generate a 
decomposable description that also happens to perfectly match the 
original interest function. 

6. RELATED WORK 

Recently, many probabilistic modeling approaches have been pro- 
posed that can also be applied to the learning problem discussed in 
this paper, including Bayesian networks |1 1 1 . statistical relational 
learning |8|, and probabilistic logic |17| . There are also work in 
building probabilistic models predicting user behavior |27| |12| [2] 
[15] [6]. However, as with SVMs, models learned using such tech- 
niques tend not to generate decomposable models. 

On the other hand, other inductive learning techniques, such as 
inductive logic programming 1 18 16 1, which aim to learn formulas 
from the training data, can produce decomposable models. How- 
ever, such tools still assume the input data to have certain class 



distribution, and it is unclear how feature selection can be done for 
such techniques. 

There are many feature selection algorithms that have been pro- 
posed in addition to mutual information and LASSO. However, our 
synthesis-based approach differs from classification techniques in 
that most feature selection techniques focus on the statistical prop- 
erties of the training data, e.g., approximating the probability dis- 
tribution of a feature based on the number of data points in the 
training set in which it appears, as in the MI metric. Such schemes 
perform well when fed with a sufficiently large amount of training 
data, as evident in our cross validation experiments, but do not do 
so well in cases when the training data size is small, as in our active 
learning scenarios. 

In recent years, the programming languages community has been 
working on programming-by-example problems to synthesize dif- 
ferent types of programs |21| 10, 9 1. Our work differs from previ- 
ous tools in that we require a feature selection mechanism in place 
in order to provide reasonable results. The work of Gulwani in (9j 
proposes querying the user to provide differentiating outputs when 
the synthesizer cannot decide between multiple programs that sat- 
isfy the same input constraints. Similar ideas appeared in flOJ. We 
generalize this concept and propose the ensemble learning scheme, 
and further show that a hybrid scheme that combines synthesis- 
based feature selection with an SVM for classification can provide 
excellent performance for social networking applications like Life- 
Join. 

7. CONCLUSIONS 

In this paper, we presented a learning algorithm that combines 
the strengths of classical machine learning techniques with pro- 
gram synthesis tools, focusing on personalized social recommen- 
dation applications. We showed that a hybrid approach, which 
first uses program synthesis to generate base learners, followed by 
breaking down into individual features and weight assignment with 
an SVM, significantly improves runtime and classification accu- 
racy. Finally, we showed that using program synthesis on the out- 
put of an SVM can yield much simpler, and more human readable 
models, which help users understand system behavior and can drive 
subsequent data collection. 

The experiments show that the hybrid approach can significantly 
outperform traditional classification schemes on synthetic data, but 
an important next step is to validate the results on real-world data. 
Similarly, more research is needed in analyzing the generalization 
properties of the synthesis-based approach. Understanding its the- 
oretical connections with classical machine learning-based tech- 
niques with help develop further algorithms that leverage the ad- 
vantages of the two in improving results. 
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